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Abstract 


Complexity,  customization,  and  packaging  of  military  platforms  and  systems 
increase  maintenance  difficulty  at  the  same  time  as  the  available  pool  of  skilled  technical 
personnel  may  be  shrinking.  In  this  environment  maintenance  training,  technical  order 
presentation,  and  flight-line  operational  practice  may  need  to  adopt  “just-in-time” 
procedural  aids.  Moreover,  the  realities  of  real-world  maintenance  may  not  permit  the 
hardware  indulgences  and  rigid  controls  of  laboratory  settings  for  visualization  and 
training  systems,  and  at  the  same  time  the  actual  activities  of  maintainers  will  challenge 
requirements  for  portable  or  wearable  devices.  This  project  has  investigated  technologies 
that  may  be  used  by  Air  Force  maintainers  for  training  or  job  aids. 

There  are  several  modalities  available  for  the  conveyance  of  maintenance 
information,  including,  text,  diagrams,  images,  speech,  video,  and  3  dimensional  models 
and  environments  as  well  as  live  demonstrations.  Currently  most  stored  maintenance 
information  is  conveyed  through  text  and  diagrams.  For  this  project  we  investigated  the 
feasibility  of  using  more  advanced  technology  such  as  head  mounted  displays  (HMD), 
fusion  trackers,  wearable  computers,  unique  input  devices,  and  AR  software.  We 
experimented  with  merging  many  of  the  available  modalities  while  concentrating  on  the 
feasibility  of  using  state  of  the  art  AR  hardware.  We  also  considered  authoring  systems 
for  instructions  and  graphical  aides  that  could  address  a  plethora  of  possible  output 
devices.  We  deemed  speech  input  or  output  systems  inappropriate  for  flight-line 
maintenance  application  due  to  the  high  noise  level  in  the  environment  and  the  relatively 
poor  performance  of  such  software.  Our  main  focus  was  therefore  on  AR  and  instruction 
authoring.  Locating  the  user’s  pose  and  view  relative  to  the  maintained  part  and 
determining  the  user’s  hand  grasps  was  a  necessary  component  for  understanding  what 
task  the  user  was  engaged  in  or  trying  to  execute.  From  demonstration  systems  we 
formulated  a  roadmap  for  re-usable  instruction  authoring  systems  and  derived  a  set  of 
novel  requirements  that  should  be  met  to  make  AR  systems  viable  for  training  or  job  aids. 
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1  Overview 


This  project  involved  the  investigation  of  Augmented  Reality  (AR)  devices  for 
maintenance  instruction  presentation  for  training  and  job  aids.  We  developed  some 
prototype  systems  to  demonstrate  AR  potentials  and  examine  the  process  of  authoring 
instructions  suitable  for  AR  or  other  presentation  modalities.  Our  ultimate  purpose  was 
therefore  to  propose,  design,  and  evaluate  novel  approaches  to  maintenance  task 
acquisition,  presentation,  and  validation.  Rather  than  take  a  formal  human  factors 
approach  to  analyzing  demonstration  systems,  we  determined  that  requirements  analysis 
was  sufficient  to  provide  a  viable  but  conservative  roadmap  for  future  AR  and  instruction 
authoring  systems  for  Air  Force  maintenance  applications. 

The  first  component  of  our  study  was  to  integrate  AR  technologies  with  electronic 
instructions.  While  we  understand  the  Air  Force’s  use  of  Technical  Orders  and 
Interactive  Electronic  Technical  Manuals  (IETMs),  we  did  not  attempt  any  integration 
with  those  media  so  far.  Likewise,  the  exigencies  of  real  vehicle  and  system  shape 
complexity  were  considered  in  formulating  the  final  recommendations  and  future 
roadmap,  rather  than  being  directly  incorporated  in  the  demonstration  system.  We  did, 
however,  allow  for  the  incorporation  and  use  of  3D  digital  models  in  future  AR  training 
and  job  aid  systems. 

The  second  component  in  our  study  had  two  subparts:  (1)  to  develop  concepts  for 
and  a  prototype  of  an  AR  or  Virtual  Reality  (VR)  system  that  enables  virtual  training, 
embedded  operations  (job  aids),  and  task  animations  with  a  consistent  user  interface;  and 
(2)  to  sketch  the  design  of  authoring  tools  that  would  allow  task  instructions  and 
animations  to  be  readily  constructed  without  adding  a  requirement  that  the  instruction 
authors  be  talented  animation  artists.  This  component  led  us  to  design  and  prototype  a 
system  architecture  called  DELITED  ("Describing,  Envisioning  and  Learning 
Instructions  Through  Expert  Demonstrations  ’’)  that  supports  these  requirements. 

The  third  component  in  our  study  was  to  informally  evaluate  the  task  effectiveness  of 
AR  configurations  for  maintenance  activities  and  human  factors  surrounding  the 
suitability  and  wearability  of  AR/VR  devices.  We  focused  less  on  this  component 
because  we  gathered  enough  experience  with  the  devices  and  our  prototype  systems  that 
we  felt  it  was  better  to  recommend  a  roadmap  for  the  future  than  to  spend  time  analyzing 
substandard  solutions. 

We  will  follow  this  outline  somewhat  in  the  presentation.  In  Section  2  we  describe 
AR  and  VR  systems  and  issues,  including  the  hardware  components  utilized  for  this 
project.  Section  3  describes  the  DELITED  architecture.  Section  4  evaluates  our 
experiences  and  provides  a  roadmap  for  future  directions  in  AR  for  maintenance 
applications. 

2  Augmented  and  Virtual  Reality 

An  Augmented  Reality  (AR)  system  generates  for  the  user  a  composite  view  by 
superimposing  virtual  information  onto  a  real  scene  with  the  goal  of  helping  the  viewer  to 
better  understand  the  environment.  The  superimposed  information  can  be  text,  such  as 
instructions,  or  a  virtual  scene.  Regardless  of  the  information  type,  it  needs  to  be 
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displayed  correctly,  at  the  right  time  and  in  the  right  place,  to  present  the  user  with  a 
unified,  single  real  visual  scene. 

There  are  two  main  methods  for  displaying  the  AR  scene  for  a  user: 

(1)  Video  based  see-through  Augmented  Reality 

In  this  method,  the  real  scene  is  captured  by  video  cameras,  and  then  merged  with 
the  virtual  scene  generated  by  a  computer.  The  real  and  virtual  scenes  are  integrated 
before  the  user  sees  them.  This  means  that  the  user  will  be  viewing  the  real  scene 
indirectly  through  the  video  camera  recording.  A  diagram  of  a  video  see-through  system 
is  shown  in  Figure  1. 


Head-mounted 

Display 


Figure  1  Video  See-through  AR  (Vallino  2006). 


(2)  Optical  based  see-through  Augmented  Reality 

In  an  optical  based  see-through  AR  system,  the  user  views  the  real  scene  directly, 
and  the  virtual  scene  is  optically  merged  directly  in  the  user's  view,  as  shown  in  Figure  2. 
This  optical  merging  can  be  done  through  the  use  of  head-mounted  displays  or  other 
projection  devices. 


Figure  2  Optical  See-through  AR  (Vallino  2006). 
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The  greatest  difference  between  AR  and  Virtual  Reality  (VR)  is  that  the  user  in  an 
AR  system  can  simultaneously  observe  a  superimposed  virtual  and  real  scene.  The  user 
in  VR  only  views  a  virtual  scene.  VR  strives  for  a  totally  immersive  environment,  while 
AR  tries  to  merge  the  real  world  scene  with  a  virtual  scene  while  maintaining  the  user’s 
sense  of  presence  in  the  real  world. 

2.1  Main  Challenges 

The  first  challenge  to  building  a  successful  AR  system  is  to  find  a  mechanism  to 
display  the  real  and  virtual  scenes  at  the  same  time,  so  that  the  virtual  scene  and  real 
scene  are  seamlessly  blended  together.  Video-based  see-through  and  optical-based  see- 
through  methods  are  two  basic  solutions  to  solve  this  problem,  shown  as  in  Figure  1  and 
Figure  2,  but  there  are  still  many  open  issues:  e.g.,  how  to  let  them  have  the  same 
perceptual  brightness,  and  how  to  manage  relative  depth  issues  (display  real  objects  over 
virtual  ones  and  virtual  objects  over  real  objects). 

The  second  challenge  is  registration  (tracking).  In  fact,  the  registration  problem 
also  exists  in  VR  and  film  special  effects,  so  it  is  not  unique  to  AR  systems,  but  the 
requirements  of  accuracy  and  real-time  performance  of  AR  make  it  more  difficult. 

For  an  immersive  VR  system,  registration  is  also  required  so  that  changes  in  the 
rendered  scene  match  with  the  perceptions  of  the  user,  because  errors  here  will  cause 
conflicts  between  the  visual  system  and  the  kinesthetic  or  proprioceptive  (orientation) 
systems.  Because  visual  perception  always  dominates  our  other  sensory  perceptions,  a 
user  in  a  VR  system  can  accept  or  adjust  to  a  visual  stimulus  that  overrides  the 
discrepancies  with  input  from  sensory  systems.  In  fact,  the  lack  of  coordination  between 
the  visual  and  vestibular  system  can  be  exploited  to  make  people  in  VR  feel  they  are 
exploring  a  large  space  when  in  fact  they  are  making  continuous  walking  turns  in  a  small 
area  (Razzaque,  Kohn  et  al.  2001).  In  contrast,  errors  of  mis-registration  in  an  AR 
system  are  between  two  visual  stimuli  that  we  are  trying  to  fuse  to  be  seen  as  one  scene. 
AR  systems  are  thus  more  sensitive  to  these  errors. 

Another  challenge  comes  from  the  real  time  performance  requirement  of  AR. 
Because  the  real  environment  is  a  true  real-time  environment,  any  delay  or  lag  time  in 
computing  and  displaying  virtual  objects  will  be  more  visible  in  AR  when  they  are 
presented  with  the  real  scene  at  the  same  time.  So  a  successful  AR  system  should  run  as 
fast  as  the  real  environment,  and  have  some  mechanism  to  make  these  two  scenes  run 
synchronously. 

The  main  challenges  for  AR  are  summarized  below: 

(1)  Displays 

a.  See  through:  AR  needs  see-through  displays  to  show  the  real  and  virtual 
scene  at  the  same  time.  But  current  see-through  displays  do  not  have 
sufficient  brightness,  resolution,  field  of  view,  and  can  not  seamlessly 
blend  a  wide  range  of  real  and  virtual  imagery. 

b.  Delay:  Some  display  delay  in  VR  may  be  tolerable,  but  any  mismatch 
between  a  virtual  and  real  scene  and  will  make  an  AR  system  fail. 
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c.  Occlusion  (Kiyokawa,  Kurata  et  al.  2000):  Augmenting  a  scene  need 
not  only  add  objects  to  a  real  environment  but  also  has  the  potential  to 
remove  them(Azuma  1997).  To  maintain  the  correct  visual  relationship 
between  virtual  and  real  objects,  some  real  objects  may  be  blocked. 

d.  Parallax  error:  Most  video  see-through  displays  have  a  parallax  error, 
caused  by  the  cameras  being  mounted  away  from  the  true  eye  optical 
axis. 

e.  Fixed  eye  accommodation:  Most  displays  have  fixed  eye 
accommodation  (focusing  the  eyes  at  a  particular  distance). 

f.  Multimodal  display:  Sometimes  AR  requires  mixed  real  and  virtual 
modalities  other  than  just  the  visual  modality,  such  as  sound  or  haptics. 
There  has  been  little  work  in  this  area. 

(2)  Tracking 

Tracking  and  sensing  are  used  to  report  the  locations  of  the  user  and  the 
surrounding  objects  in  the  environment,  which  is  also  the  basis  of  registration.  AR  places 
stringent  real-time  demands  on  trackers  and  sensors  in  three  areas: 

a.  Greater  input  variety  and  bandwidth; 

b.  Higher  accuracy; 

c.  Longer  range. 

(3)  Registration 

One  of  the  most  basic  problems  for  AR  systems  is  the  registration  problem.  An 
AR  system  should  align  its  virtual  and  real  scenes  correctly,  to  make  them  appear  to  be  in 
the  same  space.  AR  again  presents  stringent  real-time  and  positional  accuracy 
requirements. 

(4)  Interaction  modality  , 

In  an  AR  system,  objects  are  either  real  or  virtual,  but  virtual  objects  cannot  present 
haptic  (physical  solidity  and  weigh)  cues.  This  discrepancy  is  a  challenge  for  interactive 
AR  systems. 

(5)  Authoring  and  tools 

Creating  the  content  for  AR  environment  including  3D  models,  text,  overlays,  and 

interactions  is  also  a  challenge.  Creating  and  storing  semantic  information  with  the 

geometric  models  would  ease  this  task. 

During  the  course  of  this  project  we  examined  both  optical  see-through  AR  and  VR. 


2.2  Head  Mounted  Displays 

For  this  project  we  worked  with  three  different  types  of  head  mounted  displays,  a 
true  AR  display  (nVision  Datavisor),  an  opaque  display  (eMagin  Z800  3DVisor),  and  a 
sliver  display  (MicroOptical  SV-9  PC  Viewer). 
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nVision  Datavisor 

We  were  able  to  borrow  the  Datavisor  from  another  laboratory  here  at  Penn,  but 
normally  this  HMD  can  be  purchased  for  approximately  $25K  (including  the  see-through 
option).  With  the  see-through  option,  this  device  is  capable  of  true  AR.  It  allows  the 
viewer  to  see  virtual  imagery  on  top  of  a  real  scene. 

We  were  using  an  older  version  of  this  device,  but  the  state-of-the-art  version 
allows  1280  x  1024  resolution  with  80°  monocular  field  of  view  (FOV)  and  120° 
maximum  horizontal  FOV  at  1 80Hz  and  24  bit  color.  The  major  drawback  of  this  device 
is  its  size  and  weight.  We  feel  that  it  would  be  much  too  cumbersome  for  the  flight-line 
maintenance  application.  The  version  of  the  hardware  that  we  used  for  testing  was  also 
not  wireless.  In  addition  to  the  cumbersome  HMD,  it  required  at  very  large  control  box. 


eMagin  Z800  3DVisor 

We  purchased  this  display  for  approximately  $900.  It  is  a  much  less  cumbersome 
device  and  therefore  more  wearable  device  than  the  nVision  Datavisor.  It  also  includes  a 
control  box,  but  it  is  considerably  smaller  and  much  more  portable. 

This  eMagin  display  does  not  afford  AR.  It  is  a  completely  opaque  display  that 
does  support  VR  applications.  This  device  would  be  feasible  for  VR  training 
applications,  but  for  operational  environments,  it  does  not  seem  feasible.  The  maintainer 
would  be  required  to  remove  and  replace  the  device  throughout  the  maintenance  task. 
This  comfortable  device  has  360°  horizontal  FOV  (with  tracking  device),  24  bit  color  at  a 
resolution  of  800  x  600,  and  includes  stereovision.  It  also  weighs  less  than  8  oz.  and  is 
USB-powered. 
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MicroOptical  SV-9  PC  Viewer 

This  display  was  purchased  for  approximately  $990.  This  sliver  display  can  be 
mounted  on  most  glasses  including  safety  glasses,  as  shown  here.  Unlike  the  eMagin 
visor,  it  obstructs  only  a  small  portion  of  the  wearer’s  FOV  and  can  be  easily  flipped  out 
of  the  way  when  not  in  use. 

Like  the  eMagin  display,  it  does  not  afford  true  AR.  It  does,  however,  permit 
simultaneous  viewing  of  both  real  and  virtual  scenes,  though  they  are  not  superimposed. 
It  also  does  not  support  stereovision,  but  out  of  all  of  the  display  devices  this  is  the  most 
wearable  and  practical  for  the  flight-line  maintenance  application.  It  displays  24  bit  color 
at  a  resolution  of  640  x  480  and  60  Hz.  It  can  be  configured  for  the  left  or  right  eye  and 
has  14°  horizontal  FOV.  It  is  battery  powered  with  a  fully  charged  battery  lasting 
approximately  3  hours. 


2.3  Wearable  Computer 

We  looked  at  a  few  different  wearable  computers,  but  settled  on  the  Sony  Viao 
U50,  because  it  was  recommended  by  colleagues,  was  lightweight,  has  good  battery  life, 
is  affordable,  is  relatively  powerful,  and  runs  a  standard  operating  system. 


We  purchased  the  Viao  for  approximately  $2800.  The  basic  specifications  include 
Intel  Celeron  M  900MHz  processor,  512MB  RAM,  20GB  hard-drive,  64BM  VRAM,  5" 
display,  800  x  600  on  screen  resolution,  and  enhanced  battery  life  of  5.5  hours  (2.5 
standard).  It  weights  1 .21  lbs.  and  runs  Windows  XP.  Though  we  have  not  yet  made  use 


of  it  for  this  project,  it  also  includes  a  touch  screen.  One  notably  missing  feature  of  the 
U50  is  a  microphone  port.  This  missing  feature  would  make  voice  activated  applications 
more  difficult.  It  is  possible  to  connect  microphones  through  the  USB  ports. 


2.4  Input  Devices 

When  considering  how  a  maintainer  might  interact  with  an  instruction  delivery 
tool,  we  considered  traditional  input  mechanisms  (i.e.,  keyboards  and  mice).  We  feel  that 
these  tools  are  not  optimal  in  the  maintenance  environment.  Focusing  on  a  computer 
screen  and  mouse  and  keyboard  distracts  from  the  maintenance  task.  Additionally,  the 
grimy  nature  of  maintenance  is  not  conducive  to  these  devices.  Hence,  we  decided  to 
experiment  with  the  use  of  CyberGloves  and  hand  gestures.  They  may  be  worn  under 
traditional  work  gloves  that  would  help  to  protect  them. 

The  other  input  device  that  we  tested  is  the  Intersense  Fusion  Tracker.  An 
important  interaction  with  a  virtual  environment  is  synchronized  movement  of  the  eye 
and  the  virtual  camera.  For  a  maintenance  task  this  includes  positioning  the  camera  in 
the  virtual  environment  to  the  same  point  of  view  as  the  maintainer  has  in  the  real  world. 
For  this  purpose,  we  included  the  fusion  tracker  in  our  experimental  demonstration. 


Immersion  Wireless  CyberGlove 

This  new  wireless  CyberGlove  II  system  provides  22  high-accuracy  joint-angle 
measurements.  It  uses  resistive  bend-sensing  technology  to  transform  hand  and  finger 
motions  into  real-time  digital  joint-angle  data.  Each  sensor  is  extremely  thin  and  flexible 
being  virtually  undetectable  in  the  lightweight  elastic  glove.  The  basic  CyberGlove  II 
system  includes  one  data  glove,  two  batteries,  a  battery  charger,  and  a  USB/Bluetooth 
technology  adapter  with  drivers.  The  CyberGlove  has  0.5°  resolution  and  repeatability  to 
1°  .  The  typical  data  rate  is  100  records  per  second.  Its  operating  range  is  within  a  30 
foot  radius  of  the  USB  Bluetooth  adapter. 
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Intersense  Fusion  Tracker  IS-1200 

This  is  a  wide-area,  wearable,  6-DOF  hybrid  tracking  and  navigation  system 
designed  for  AR  and  mobile  computing  applications.  It  uses  an  inertial  tracker  for 
orientation  and  an  optical  sensor  for  position.  Its  accuracy  is  0.1°  in  orientation  and  3.0 
mm.  in  position.  Circular  data  matrix  fiducials  provide  up  to  32,000  unique  position 
references.  The  update  rate  is  180  Hz  and  it  can  be  interfaced  via  Ethernet,  shared 
memory,  USB,  or  RS-232. 


2. 5  Demons  tra  tion  Ap plica  tion 

We  designed  a  demonstration  to  test  the  individual  devices  as  well  as  their 
interactions  and  applications  to  instruction  delivery.  We  chose  to  center  our 
demonstration  on  a  piece  of  hardware  that  was  readily  available  to  us  and  has  some 
degree  of  complexity,  our  (old)  video  editing  rack. 


The  demonstration  involves  a  user  wearing  a  display  device,  CyberGlove,  Sony 
Viao,  and  Fusion  Tracker.  Instructions  are  displayed  on  wearable  viewer  and  hand 
signals  from  the  CyberGlove  allow  the  user  to  cycle  through  the  instructions  and  activate 
and  deactivate  the  devices.  The  Viao  is  the  central  controller  for  the  system,  running  all 
of  the  necessary  software  and  permitting  the  user  to  be  entirely  untethered.  The  Fusion 
Tracker  can  be  used  to  track  the  position  and  orientation  of  the  user  and  thereby 
customize  the  view  of  the  virtual  rack  being  displayed.  The  accompanying  video  shows  a 
user  wearing  the  MicroOptical  display  and  CyberGlove  to  properly  setup  the  video  rack 
and  copy  a  tape. 


9 


While  the  overall  application  was  straightforward,  the  knowledge  gained  about 
the  devices  and  issues  present  in  instruction  delivery  applications  were  invaluable.  Our 
first  consideration  was  the  display  devices  and  their  feasibility  in  this  application.  For  the 
most  part,  all  three  of  the  display  devices  were  easy  to  get  working.  All  that  was  required 
was  to  properly  set  the  display  resolution  and  refresh  rate.  The  DataVisor  and  3DVisor 
had  additional  possible  settings.  The  DataVisor  is  capable  of  true  AR  allowing  the 
virtual  and  real  world  to  merge.  This  would  be  ideal  for  this  application  facilitating  the 
highlighting  of  real  objects  with  virtual  designations  and  information.  However,  it 
quickly  became  apparent  that  the  DataVisor  was  not  well  suited  for  practical  application. 
The  device  is  rather  heavy  and  awkward;  performing  maintenance  instructions  while 
wearing  it  would  be  quite  difficult.  The  3DVisor  is  much  less  cumbersome,  but  it  is  not  a 
see-through  display,  completely  blocking  the  user’s  view  while  it  is  being  worn.  This 
means  that  the  user  would  have  to  remove  it  before  doing  the  maintenance  instruction  and 
replace  it  again  to  get  more  information  about  the  task. 

The  compact  design  of  the  MicroOptical  display  makes  it  the  most  feasible 
display  for  this  application.  It  addition  to  its  unobtrusive  design  that  can  be  mounted  on 
many  different  types  of  glasses,  this  display  can  easily  be  flipped  out  of  view  entirely.  As 
with  all  such  display  devices,  the  resolution  is  small  (640  x  480)  and  font,  font  sizes,  and 
color  need  to  be  carefully  chosen  to  ensure  that  the  user  can  easily  read  the  information 
being  displayed.  Certain  things  that  are  taken  for  granted  when  designing  large  size 
(workstation  display  screen)  interfaces  become  a  challenge.  Font  choice  is  important, 
because  text  needs  to  be  legible  at  a  small  resolution.  Fonts  “sans  serifs”  are  easier  to 
read  when  they  are  smaller  and  bold-facing  them  is  helpful.  Color  is  also  an  important 
thing  to  consider.  On  small  displays,  light  color  text  on  a  dark  background  is  easier  to 
read  than  dark  text  on  light  backgrounds. 

For  our  demo  application  we  constructed  a  simple  GUI  (graphical  user  interface) 
in  FLTK  (fast  light  tool  kit  2006)  that  displayed  an  image  of  the  video  rack  and  allowed 
the  user  to  cycle  through  an  instruction  set. 

When  considering  controllers  for  our  application,  we  were  looking  for  another 
unobtrusive  device  that  is  also  easy  to  use.  We  purchased  a  wireless  CyberGlove  which 
is  thin  enough  to  be  worn  under  work  gloves.  Being  wireless  allows  unencumbered 
movement.  We  wrote  a  hand  shape  recognizer  in  C++  using  the  provided  SDK  (software 
developer’s  kit).  The  code  recognizes  three  hand  shapes/gestures;  an  open  hand  to  toggle 
activation  of  the  recognition  system,  ensuring  that  interaction  with  the  system  is 
intentional,  a  fist  gesture  to  move  to  the  next  instruction,  and  a  pointing  gesture  to  move 
to  the  previous  gesture.  All  hand  shapes  must  be  held  for  a  second  for  recognition.  The 
software  system  is  actually  set  up  such  that  any  gestures  can  be  used.  A  GUI  was 
written,  again  in  FLTK,  to  allow  the  user  to  record  individualized  gestures  for  each 
interactive  command.  This  allows  the  user  to  customize  the  interface  to  any  comfortable 
and  memorable  gesture  set. 

At  this  stage  we  can  visualize  instructions  and  images  on  a  sliver  display 
interacting  with  the  application  through  gestures  recognized  from  CyberGlove  input.  A 
true  AR  system  can  additionally  take  into  account  the  user’s  point  of  view  of  the  scene. 
This  enables  the  system  to  aid  the  user  in  identifying  parts  and  states.  We  used  the 
Fusion  Tracker  to  track  the  position  and  orientation  of  the  user’s  head.  A  mockup  of  the 
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scene  was  then  displayed  in  an  OpenGL  window.  The  Fusion  Tracker  is  small  and 
lightweight.  It  is  easily  mounted  on  a  helmet  or  cap. 


2.6  Development  Issues 

During  the  implementation  of  our  demonstration  application  we  encountered  a  few 
issues  that  needed  to  be  addressed.  The  viability  of  the  displays  is  stated  above.  In  the 
end  we  feel  that  the  MicroOptical  display  is  a  quite  viable  choice  for  maintenance 
instruction  delivery. 

Overall  the  CyberGlove  is  a  well-designed  and  reliable  device.  The  licensing  of  the 
SDK,  however,  caused  a  few  problems.  When  installing  the  SDK  a  code  is  generated. 
This  code  is  then  emailed  to  Immersion  who  returns  another  code  to  be  entered  in  the 
authorization  software  to  permanently  unlock  or  authorize  the  software.  In  itself,  this  is 
not  a  bad  procedure;  however,  this  procedure  only  authorizes  the  software  for  one  user  on 
the  computer  where  it  is  installed.  Installing  the  software  on  another  computer  or 
reinstalling  the  software  on  the  same  computer  or  allowing  another  user  on  the  computer 
to  use  the  software,  requires  sending  and  receiving  a  new  code  from  Immersion.  While 
Immersion  was  very  prompt  in  sending  the  codes,  in  our  lab  setting  and  particularly  on  a 
team  project  this  authorization  procedure  was  less  than  ideal. 

The  SDK  for  the  CyberGlove  seems  well-developed,  at  least  for  this  application. 
We  extended  a  few  of  the  methods  easily.  Our  hand  shape  recognition  code  for  this 
demonstration  was  not  sophisticated  or  robust.  The  code  includes  a  tolerance  in  the  hand 
shape  comparisons  (between  the  stored  sample  and  the  real-time  hand  shapes).  This 
tolerance  is  specified  in  degrees  for  each  joint  angle.  Some  preliminary  experimentation 
has  shown  that,  optimally,  different  tolerances  are  needed  for  different  people.  A  more 
robust  technique,  perhaps  a  machine  learning  algorithm,  would  correct  this  small 
problem. 

The  Fusion  Tracker  was  much  more  difficult  to  get  working.  It  is  a  relatively  new 
product  that  was  not  well  documented.  Getting  the  tracker  fully  working  required  several 
lengthy  calls  to  technical  support  and  returning  the  tracker  for  repair  after  a  firmware 
update.  The  next  challenge  was  to  get  the  device  working  through  a  USB  port  instead  of  a 
serial  port.  Using  the  USB  port  provides  the  necessary  power  to  the  tracker,  whereas 
using  a  serial  port  requires  an  AC  power  source  which  would  restrict  movement.  The 
tracker  also  requires  a  fair  amount  of  set  up  for  an  environment,  including  calculating  the 
size  and  positions  of  the  visual  fiducials  and  attaching  them.  Once  this  setup  is  done  for 
an  environment  it  is  not  required  again.  The  hardest  part  of  dealing  with  the  tracker  was 
figuring  out  how  the  data  that  the  tracker  was  giving  us  corresponded  to  our  coordinate 
system.  The  easy  part  was  dealing  with  the  API.  Although  not  every  aspect  has  been 
documented  yet,  it  was  mostly  intuitive.  Once  the  server  code  is  running  on  a  computer, 
there  are  only  a  few  function  calls  needed  to  receive  the  streaming  data. 

When  we  started  this  project,  the  initial  idea  was  to  use  the  tracker  to  identify 
where  the  user  was  looking  at  and  overlay  images  corresponding  to  certain  instructions 
onto  what  the  user  was  seeing.  However,  the  see-through  display  that  we  had  was  far  too 
bulky  for  that  idea  to  be  practical.  An  alternative  to  overlaying  images  would  be  to  use 
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the  tracker  to  find  out  where  the  user  is  looking,  and  use  a  static  image  appropriate  for 
that  viewpoint.  For  example,  if  the  user  is  supposed  to  press  a  button  on  the  video  rack, 
but  standing  farther  away  from  the  device,  then  put  an  image  of  the  entire  device  on  the 
screen  while  highlighting  the  general  area  that  the  user  should  be  focusing  on.  As  the 
user  steps  closer  to  the  device,  display  a  closer  view  of  the  video  rack. 

One  of  the  major  constraints  of  the  tracker  is  the  need  for  fiducial  targets.  In 
order  to  get  correct  translation  and  orientation  information,  there  must  me  at  least  four 
fiducials  in  the  field  of  view  of  the  tracker’s  camera,  and  the  tracker  must  be  within  some 
distance  of  the  targets.  If  this  is  not  the  case,  this  tracker  will  most  likely  start  to  drift, 
meaning  that  data  being  returned  from  the  tracker  states  that  the  tracker  is  slowly  moving 
or  rotating  in  some  random  direction.  The  size  of  the  targets  dictates  how  far  away  the 
tracker  can  be  located  while  still  returning  reliable  data.  Our  setup  consisted  of  a  grid  of 
targets  four  inches  wide  spaced  roughly  two  feet  apart.  This  enabled  us  to  get  consistent 
readings  up  to  seven  feet  away.  In  the  case  of  an  aircraft  hangar  this  might  not  be  a 
viable  solution.  The  targets  would  most  likely  have  to  be  situated  on  the  device  being 
operated  on.  However,  this  may  give  rise  to  problems  while  trying  to  maintain  a  view  of 
at  least  four  fiducials. 

The  Sony  Viao  that  we  used  was  adequate  for  our  demonstration  application. 
Because  it  is  a  small,  wearable  device  it  is  not  very  powerful.  We  question  its  feasibility 
as  the  program  size  and  complexity  increase.  Since  our  purchase  of  this  Viao,  they  have 
discontinued  this  model,  but  they  are  producing  newer  and  slightly  more  powerful 
models. 


3  The  DELITED  Architecture 

Humans  excel  at  learning  physical  tasks  quickly  when  shown  example  actions  and 
given  minimal  verbal  instruction.  Typical  instruction  presentations  may  include  video  of 
a  specific  task  performance  or  written  text  and  images.  There  are  cost,  effort,  and 
validation  issues  with  these  traditional  media:  they  require  expert  video  or  textual 
instruction  authoring,  the  time  to  produce  useful  instructional  materials  may  exceed  given 
time  constraints,  the  visual  media  may  not  include  crucial  views  or  steps,  and  written 
instructions  may  have  semantic  flaws  or  ambiguities. 

We  are  pursuing  a  new  direction  in  multimedia  instruction  authoring  to  address  and 
attempt  to  ameliorate  all  of  these  problems.  Using  a  subject  matter  expert  (SME)  as  task 
performer,  we  directly  motion  capture  the  SME’s  actions  in  the  context  of  the  actual 
space.  While  executing  the  task,  the  SME  is  also  videotaped  and  audio  recorded  to 
obtain  a  narrative  of  relevant  verbal  instructions,  annotations,  and  comments.  The 
motion  capture  data  is  used  to  create  novel  views  of  the  pre-built  3D  objects  being 
manipulated.  The  audio  stream  is  used  to  help  segment  the  visual  and  motion  capture 
data  into  more  atomic  actions.  These  actions  are  stored  as  parameterized  actions  so  that 
they  can  be  used  flexibly  to  issue  instructions  in  video,  virtual  reality  3D,  or  illustrated 
text.  In  addition,  the  parameterized  actions  are  the  basic  representation  for  re-animation 
of  the  task  with  a  virtual  human  maintainer,  thus  providing  uniform  semantic  execution, 
visualization,  and  verification  of  the  instructions. 
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In  the  last  decade  or  so,  Badler  and  colleagues  Bonnie  Webber,  Mark  Steedman, 
Martha  Palmer,  and  Aravind  Joshi  made  deep  inroads  into  understanding  the  nature  and 
semantics  of  natural  language  (NL)  instructions  for  virtual  human  agents  (Badler, 

Webber  et  al.  1990)  (Webber,  Badler  et  al.  1995)  (Badler,  Palmer  et  al.  1999).  The  major 
outcome  of  these  studies  was  the  Parameterized  Action  Representation  (PAR)  for  turning 
textual  instructions  in  animated  behaviors  for  multiple  individual  objects  and  agents 
(Badler,  Bindiganavale  et  al.  2000).  Much  of  this  work  was  architectural  in  nature,  and 
prototypes  were  constructed  for  domains  such  as  vehicle  maintenance,  checkpoint 
monitoring,  and  even  crowd  behaviors  (Allbeck,  Kipper  et  al.  2002;  Badler,  Erignac  et  al. 
2002).  The  PAR  included  an  action  database  (Actionary),  a  NL  parser,  and  a  simple  NL 
generator.  The  underlying  virtual  human  actions  included  a  wide  range  of  head,  eye, 
body,  arm  (reach),  and  locomotion  behaviors  and  was  extensible  via  language  commands 
and  “standing  orders”  through  the  parameterization  inherent  to  the  PAR  (Bindiganavale, 
Schuler  et  al.  2000). 

While  working  to  fill  the  Actionary  with  PAR  instances,  Rama  Bindiganavale 
attacked  the  PAR  authoring  problem  (Bindiganavale  and  Badler  1998).  Populating  the 
Actionary  by  hand-coding  PARs  was  possible  but  tedious.  We  began  to  develop  tools  for 
learning  PAR  parameters  by  observation.  Using  3D  motion  capture  data,  we  obtained 
movement  exemplars  for  tasks  such  as  lifting  a  box  with  two  hands,  drinking  liquid  from 
a  mug,  or  touching  one’s  nose.  As  anyone  who  works  with  motion  capture  data  knows,  it 
requires  some  significant  clean-up  and  retargeting  (Gleicher  2001)in  order  to  replay  a 
motion  on  a  different  character,  since  the  target  likely  has  a  different  body  segment  sizes 
and  lengths  than  the  original  subject.  Instead  of  mapping  the  source  motion  directly  to 
another  character,  Bindiganavale  generated  a  PAR  with  enough  information  to 
characterize  the  salient  features  of  the  captured  action.  Once  stored  as  a  PAR,  it  could  be 
readily  re-executed  (retargeted)  to  a  different  human  figure.  The  interesting  part  of  this  is 
the  nature  of  the  “salient  features”.  Harking  back  to  the  early  methods  Badler  developed, 
we  used  motion  zero-crossings  and  other  motion  change  features  to  segment  the  motion 
capture  into  “chunks”  that  became  path  and  reach  goals  for  the  PAR.  Essentially,  each 
motion  capture  performance  fixed  one  or  more  constraints  that  were  stored  in  a  PAR.  For 
many  simple  tasks,  one  performance  was  sufficient  to  establish  all  the  semantically 
important  constraints:  that  is,  the  action  description  could  be  generalized  from  only  one 
or  two  examples.  This  worked  by  fixing,  in  parallel,  one  or  more  constraints  for  the 
salient  parts  of  the  movement:  thus  one  performance  might  fix  end  effector  reach,  grasp 
and  release  targets  simultaneously.  By  determining  the  constraints  from  the  actual 
performance  we  avoided  both  long  training  sequences  and  explicit  formulation  of  the 
training  objective  function.  Note  that  not  all  human  movements  may  be  learned  so 
quickly;  in  particular,  “expert”  skills  may  require  individual  practice  and  refinement  to 
achieve  targeting  accuracy,  speed,  coordination,  and  so  on.  Also,  other  expert  actions 
with  physically  complex  systems  will  require  significantly  more  complex  manual 
interactions. 

In  parallel  to  our  physical  movement  learning  work  we  were  interested  in 
understanding  what  features  in  human  movement  were  communicatively  meaningful. 
Classic  psychological  studies  of  movement  led  to  gesture  types  such  as  emblems,  beats, 
deictics,  and  metaphorics  (Cassell,  Pelachaud  et  al.  1994).  But  an  alternative  view  was 
presented  by  human  movement  observers,  particularly  as  expressed  by  Laban  Movement 
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Analysis  (LMA)  (Bartenieff  and  Lewis  1980).  The  part  that  interested  us  was  the  notion 
of  movement  qualities.  Roughly  speaking,  movement  qualities  are  the  adverbs  relative  to 
a  particular  movement  verb.  Significantly,  while  information  may  be  conveyed  by  the 
movement  “verb”,  the  performer’s  attitude  toward  the  matter  is  conveyed  in  the  motion 
qualities.  Consider  the  phrase  “a  threatening  gesture”:  we  don’t  have  any  clue  what  the 
gesture  motion  (verb)  actually  was,  but  we  do  know  its  performance  was  perceived  as 
threatening.  LMA  was  our  inspiration  for  a  motion  quality  representation  we  called 
EMOTE  (Chi,  Costa  et  al.  2000). 

Though  first  designed  as  an  animation  tool  (for  adding  motion  qualities  to  an  existing 
gesture  form),  EMOTE’s  emergent  utility  now  appears  to  be  as  an  intermediate 
representation  between  movement  data  and  communicatively  meaningful  (linguistic) 
terms.  For  example,  in  recent  psychological  research,  Ambady  has  shown  that  students 
who  observe  “thin  slices”  in  time  of  a  teacher  in  a  non-verbal  presentation,  produce 
evaluations  that  correlate  highly  with  long-term  evaluations  of  the  same  teachers 
(Ambady,  Bemieri  et  al.  2000).  Surprisingly,  the  thin  slices  can  be  as  short  as  2  seconds, 
and  the  evaluations  do  not  correlate  with  physical  attractiveness.  The  study  authors  have 
subjects  score  teachers  on  so  called  “molar  behaviors”:  English  words  associated  with 
personality  characteristics.  But  these  molar  behaviors  have  not  obvious  behavioral 
(movement)  definitions.  We  postulate  that  the  EMOTE  parameters  can  intermediate 
between  motion  (numerical  data)  and  such  molar  behaviors.  Liwei  Zhao  and  Badler 
showed  that  EMOTE  parameters  can  be  rather  reliably  measured  in  both  motion  capture 
and  video  stereo  vision  data  performed  by  professional  LMA  notators  (Zhao  and  Badler 
2005)  (Zhao,  Lu  et  al.  2001).  The  measurements  and  quality  recognition  were  handled 
by  trained  neural  nets.  We  are  working  on  re-engineering  this  system  to  recognize  in 
real-time  EMOTE  qualities  in  3D  human  motion  capture. 

A  primary  objective  must  be  to  reproduce  the  salient  features  (constraints)  of  the  task 
and  modify  motion  and  action  qualities  to  suit  context.  Context  will  include  knowledge 
of  objects  as  gauged  by  the  human  instructor’s  own  approach  to  the  task.  Parameterized 
actions  are  the  ultimate  container  for  generalized  but  contextual  movement  information. 
Others  have  done  movement-by-example  (Atkeson  and  Schaal  1997)  (Buchsbaum  and 
Blumberg  2005)  (Siskind  2001)  but  none  with  a  parameterization  suitable  for  linguistic 
(instruction)  connections.  Recent  developments  in  “apprentice  learning”  (Abbeel  and  Ng 
2004)  are  relevant  but  do  not  intrinsically  address  the  communication  of  non-linear 
features  across  physical  and  linguistic  channels.  Thus  DELITED  must  manage  the 
verbalized  instructions  that  almost  always  accompany  the  demonstration  of  physical 
actions.  The  PARs  created  from  the  physical  demonstration  will  have  their  constraints 
and  parameters  learned  from  motion  data  as  well  as  the  verbalizations.  The  two 
communication  channels  will  complement  one  another:  movements  will  indicate 
locations  and  trajectories  while  language  may  indicate  action  type  and  movement 
qualities.  PARs  also  help  insure  that  actions  are  not  trainer-specific.  We  may  also  learn 
what  makes  a  better  expert,  depending  on  how  quickly  (say,  by  counting  back-end 
instruction  adjustment  time)  DELITED  gets  the  resulting  action  represented  and 
described  correctly. 
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To  pull  all  these  threads  together  requires  the  integrated  design  and  prototyping  of 
the  DELITED  system  to  generate  textually  useful  and  visually  validatable  instructions. 
The  DELITED  environment  would  capture  3D  motion,  digital  video,  and  audio  of  an 
expert  performing  some  maintenance  task  on  a  physical  device.  The  device  will  be 
previously  modeled  as  a  3D  object  with  separable  components  and  manipulable  parts. 
The  initial  position  of  the  device  in  the  motion  capture  space  will  be  known  or  computed, 
but  thereafter  the  parts  will  not  be  separately  tracked  by  computer  vision  or  other  direct 
sensing  means  (e.g.,  augmented  reality  assembly  using  Bar  Coded  parts  (Seligmann, 
Feiner  et  al.  1996)).  The  idea  is  to  understand  the  manipulation  sequence  completely 
from  the  expert’s  own  captured  actions  and  speech  utterances.  It  is  a  major  hypothesis 
that  this  can  be  done,  though  we  will  obviously  leave  the  door  open  to  using  computer 
vision  cues  from  the  live  digital  video  feed  in  the  future. 


3.1  Outline  of  DELITED  Components 

An  outline  of  the  DELITED  methodology  follows.  The  general  flow  of  information 
is  illustrated  in  Figure  3. 


IT.-.', 


Capture  expert  performance  via 
multimedia 


Interactively  segment  and  clean 
up  multimedia  tracks 

I 

^  .  ^  Re-use 

Link  task  segments  to  PAR 

database 

Output  PARs  as  instructions  for 
interactive  presentation 


Present  instructions  via 
interactive  manuals,  AR,  or  VR 


F.  %•  ' 


Figure  3.  DELITED  data  flow  and  architecture. 
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1 .  We  can  determine  where  the  expert’s  hands  are  from  motion  capture  and  what 
handshapes  they  are  in  from  the  Cybergloves.  A  critical  component  of  the  AR 
system  is  translating  the  CyberGlove  hand  and  finger  joint  angles  into  grasp 
types.  We  can  determine  grasp  and  release  actions  from  handshape  changes. 
Knowing  what  the  user’s  hands  are  doing  is  necessary  if  we  are  to  understand 
whether  or  not  the  user  is  performing  the  correct  task  action  on  some  objects  in 
the  maintained  environment. 

2.  From  grasp  and  release  actions  and  spatial  proximity  of  object  parts,  we  can  infer 
the  objects  of  the  grasp  or  release.  For  grasped  objects,  state  depends  on  whether 
another  hand  remains  in  contact,  the  nature  of  that  contact  (holding  vs.  grasping, 
e.g.)  and  physics.  Having  a  physical  simulation  monitoring  the  situation  seems 
prudent  if  not  crucially  essential. 

3.  From  the  motion  of  grasped  parts  we  can  update  the  3D  model  to  reflect  state 
change.  Note  that  the  model  itself  may  be  under  the  influence  of  the  physical 
simulation.  Thus,  e.g.,  we  were  able  to  model  part  integrity  failure  in  the  Up- 
Lock  Hook  example  we  did  for  the  Air  Force  a  number  of  years  ago:  removing 
one  of  the  attachment  bolts  completely  resulted  in  a  spring  being  deprived  of  its 
pivot  and  retainer,  so  it  fell  out  of  the  Up-Lock  Hook  assembly. 

4.  The  3D  model  of  the  object  and  its  assembly  must  be  created  and  marked-up  with 
suitable  semantic  information  such  as  part  degrees-of-freedom,  attachment  types, 
manipulate  sites,  etc.  Ultimately  this  is  a  considerable  amount  of  pre-DELITED 
processing,  but  we  need  the  mark-up  until  we  determine  how  to  work  without  it. 

It  is  possible  that  having  the  expert  annotate  the  object  interactively  (verbally  with 
manual  gestures)  is  a  good  compromise  that  maintains  the  DELITED 
methodology:  i.e.,  the  expert  can  point  to  each  part  or  feature  and  say  what  its 
attributes  are.  These  utterances  and  pointing  gestures  can  be  used  to  tag  the 
referenced  parts  with  the  feature  tags.  Thus,  e.g.,  “ here  is  the  power  connection; 
it  is  a  bayonet  connector  “ here  is  the  hydraulic  connector;  to  remove  it  turn 
counterclockwise,  but  be  sure  the  pressure  is  off”',  or  “this  is  the  handgrip  for 
removing  the  unit  from  the  equipment  bay.  ”  We  did  not  implement  this,  but  it 
appears  feasible. 

5.  From  the  motion  capture  stream,  we  can  segment  motions  into  actions.  We  did 
this  in  the  DELITED  prototype  by  building  an  interface  on  top  of  Apple 
Quicktime.  The  user  sees  linear  editing  tracks  for  each  input  modality:  motion 
capture,  audio,  video  stream(s).  The  user  can  manually  eliminate  irrelevant  time 
segments,  align  motions  and  verbalizations,  and  save  video  or  motion  capture 
sequences  as  PARs.  A  streaming  video  demonstration  of  this  system  is  available 
as  “RIVET- 1”  at  http://hms.upenn.edu/RIVET/. 1  From  the  audio  stream  we  can 
transcribe  speech  to  text.  We  can  also  find  intonation  contours  and  silences  that 
may  be  useful  in  establishing  action  segmentation.  Also,  part  noises  may  also  be 
used  to  establish  contacts  or  state  changes  (e.g.,  a  component  being  set  down  on 
the  ground  might  be  audibly  detectable).  Pick-up  and  release  of  tools  or  other 


1  Note  that  this  prototype  was  created  as  part  of  a  complementary  project  with  NASA. 


16 


object  might  be  noted  in  speech  making  key  connections  in  parameterized  actions 
to  objects  that  need  not  be  detected  visually. 

6.  Motion  segmentation  of  action  streams  into  PARs  remains  a  crucial  step.  Durell 
Bouchard  is  studying  automated  motion  segmentation  schemes.  There  are  several 
possibilities  to  pursue:  The  use  of  single  camera  video  view  to  isolate 
segmentation  events,  find  robust  numerical  segmentation  rules,  or  use  the 
emergence  of  specific  EMOTE  parameters  to  segment  actions.  We  believe  this  is 
eminently  realistic  as  some  preliminary  experiments  by  Liwei  Zhao  on  projecting 
the  3D  movements  into  a  2D  plane  still  resulted  in  useful  segmentations  and 
EMOTE  parameter  recognition  {Zhao,  2005  #118}.  In  addition  to  the  motion 
capture  data  we  will  have  the  expert’s  audio  stream.  Knowing  that  speech 
instructions  precede  gestures,  we  can  use  speech  breaks  as  cues  to  action 
segmentation.  Our  prototype  relies  on  manual  segmentations  into  actions  (PARs). 

7.  While  the  EMOTE  parameter  set  may  be  overkill  for  this  application,  it  might 
give  insight  into  local  motion  changes  that  are  significant  precursors  and  triggers 
for  segmentable  actions.  For  example,  a  change  in  the  rotation  speed  of  the 
thumb-index  finger  axis  or  wrist  spatial  path  may  differentiate  turning  from 
loosen  or  tighten.  Posture  changes  may  signal  weight  shifts  that  accompany 
movement  of  heavy  parts  or  required  torque  application. 

8.  From  head  motion  we  can  establish  a  visual  line  of  regard  and  use  that  to  augment 
information  on  the  reach  target. 

9.  Statements  about  cautions,  warnings,  and  preferred  practice  might  be  included  in 
the  audio  stream.  The  expert  can  be  given  a  written  check-list  of  items  to 
establish  prior  to,  during,  or  after  the  procedure  is  demonstrated.  Some  of  this  can 
even  be  done  on  a  web-based  fill-in  (typed  or  menu)  form:  e.g.,  check  that  power 
is  off,  manage  hazardous  materials  properly,  or  describe  typical  failure  modes. 
There  might  be  instances  where  the  expert  wants  specific  camera  views.  There 
might  be  a  preliminary  list  of  tools  and  extra  parts  needed  and  what  to  do  with 
removed  parts  (save  for  re-use,  stow,  or  dispose).  Some  of  this  information  can 
be  transcribed  directly  or  even  left  in  the  multi-media  (video  and  audio)  part  of 
the  presentation. 

10.  The  text  processing  might  be  facilitated  by  a  pre-built  data  dictionary  of  tools, 
part,  and  assembly  names.  Also,  there  will  be  an  evolving  Actionary  of  PARs 
that  describes  actions.  The  names  (actions)  of  these  PARs  can  be  used  to  bias  the 
speech  to  text  transcription  to  more  probable  utterances.  Having  such  a  limited 
vocabulary  and  allowing  it  to  be  used  by  the  transcribing  software  is  a  feature  to 
look  for  in  commercial  speech  recognizer  software. 

11.  From  the  motion  segmentation,  the  grasp  state  and  the  parts  or  tools  involved, 
PARs  describing  the  action  will  be  hypothesized  and  relevant  parameters  saved. 
Nonlinear  and  parallel  action  requirements  will  come  from  the  motion  capture  and 
may  also  be  cued  by  the  text:  e.g.  while  holding  the  nut,  insert  the  bolt  into  the 
hole  and  hand-tighten.  If  one  does  this  carefully,  it  might  be  possible  to  work 
with  two  individuals  cooperatively  (though  the  information  capture  becomes  more 
complex). 
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12.  Given  the  motion  capture,  video,  and  audio  streams,  plus  a  possible  text 
transcription  and  preliminary  (automatic)  segmentation,  the  entire  dataset  can  be 
presented  in  an  interactive  tool.  Here  the  user  can  interactively  view  and  fix  the 
text,  see  that  any  PAR  actions  are  sensible  and  sensibly  labeled.  A  preliminary 
textual  version  of  the  instructions  based  on  the  PARs  and  necessary  segments  of 
the  uttered  narrative  can  be  created  immediately  and  examined.  Segments  of  the 
video  will  be  directly  attached  to  the  PARs,  but  the  3D  models  and  their  predicted 
manipulations  can  be  created  and  associated  with  each  PAR  as  well.  A  useful 
feature  of  this  interface  is  the  capability  of  directly  authoring  instructions  from 
PARs  without  using  captured  data:  by  using  a  drag  and  drop  menu  interface  to 
author  instructions  directly  from  the  PAR  Actionary  and  (possibly)  other  data 
sources.  A  prototype  of  this  component  of  DELITED  has  also  been  implemented 
as  “RIVET-2”  at  httn.V/hms.unenn.edu/RIVET/.2 

13.  An  animation  based  on  the  PARs  can  be  launched  as  a  visual  instruction 
validation.  On  execution  of  a  PAR,  the  3D  model  can  be  separately  manipulated 
for  different  views  by  the  end  user.  Requirements  here  include: 

a.  Need  good  dexterous  hand  model. 

b.  Need  robust  grip  and  manipulation  of  object  parts. 

c.  Need  motion  controllers  for  each  uninstantiated  PAR  type. 

d.  Need  to  delineate  object  forms  being  manipulated,  perhaps  with  complex 
manual  procedures  with  deformable  object  parts.  (E.g.,  stretching  bands 
over  pulleys,  winding  cable  around  spindle,  etc.) 

e.  The  synthesized  animation  can  itself  be  flexibly  viewed  as  training  or  as 
an  operational  aid. 

f.  Re-animate  from  a  variety  of  body  positions  or  orientations  for  individual 
variations  or  microgravity  application. 

14.  Other  issues  that  can  be  investigated  and  possibly  handled. 

a.  Semantic  representation  for  3D  parts  and  assemblies. 

b.  Representing  part  failure  modes. 

c.  Representing  broken  parts. 

d.  Representing  deformable,  stretchable,  and/or  flexible  parts. 

e.  Animating  part  motion  in  microgravity.  (Physics  engine) 

f.  Tool  representation,  database,  and  management. 

g.  Repair  components  and  substitutes,  e.g.,  tape,  wire,  solder,  etc.,  in  an 
availability  database  (analogous  to  toolset). 

h.  How  does  one  move  this  whole  information  acquisition  (hardware)  set-up 
to  an  in  situ  site  and  manage  real  situations  with  complex  collision 
avoidance,  accessibility,  and  reach. 

i.  Disassembly  planner  based  on  3D  model  and  semantic  markup? 

j.  Determining  action  conditionals  for  PAR,  such  as  preparatory 
specifications,  terminations,  etc.  Come  from  text  and  geometry? 

k.  Automatic  generation  of  re-assembly  procedure  as  reverse  of  disassembly. 


2  Note  that  this  prototype  was  also  created  as  part  of  a  complementary  project  with  NASA. 
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l.  Enumeration  of  necessary  tools  and  repair  items,  including  consumables. 

m.  Capability  to  (audio)  record,  save,  and  playback  values,  positions,  state, 
etc.  for  proper  re-assembly  and  operation. 

n.  Physical  and/or  kinematic  simulation  of  modeled  assembly  for  operation, 
manipulation,  testing,  calibration,  etc. 

o.  Physical  presentation  to  end  user  as  VR  or  AR. 

p.  XML  formats  (DTD)  for  presentation  interface. 

People  learn  how  to  assemble,  disassemble,  and  repair  things  by  reading,  looking, 
and  doing  -  often  all  interleaved.  Maintenance  presents  useful  and  complex  procedural 
issues  with  timing,  manual  skills,  and  sensory  feedback.  Many  of  these  characteristics 
are  multi-channel:  e.g.,  “turn  nut  to  loosen”  expresses  a  basic  action  (turn),  a  temporal 
condition  (constantly),  and  a  termination  condition  based  on  a  change  in  resistance  to 
movement  (a  motion  quality).  The  DELITED  architecture  should  be  able  to  generate 
instructions  for  and  simulate  a  significant  maintenance  task  after  being  shown  the  basic 
manual  skills  and  terminology. 


4  Evaluation,  Summary,  and  Recommendations 

Throughout  this  project  we  have  been  concerned  with  testing  the  ease  of  use  of  these 
devices,  their  reliability,  and  their  feasibility  in  the  maintenance  domain.  We  are 
concerned  that  the  devices  maybe  too  inhibiting  for  the  maintainers,  even  though  they  are 
relatively  compact.  Ultimately,  it  must  be  determined  if  the  benefit  is  worth  the  cost.  In 
reviewing  the  benefits  of  using  VR/AR  equipment,  we  must  consider  the  ease  of  use  and 
type  and  amount  of  information  that  can  be  conveyed  when  compared  to  existing 
methods  (computer  screens  and  paper).  This  information  must  originate  somewhere  and 
somehow  during  the  instruction  authoring  process. 

We  are  interested  in  and  would  recommend  investigating  novel  instruction  authoring 
systems  that  would  provide  additional  data  to  be  advantaged  in  a  VR/AR  instruction 
delivery  system.  We  propose  expert  authoring  of  instructions  by  demonstration.  An 
expert  can  be  captured  through  audio,  video,  and  motion  capture  performance  of  a 
maintenance  task.  These  modalities  can  then  be  used  both  in  the  authoring  of  instructions 
and  as  additional  data  during  the  delivery  of  the  instructions.  Over  the  past  several  years 
we  have  been  developing  a  Parameterized  Action  Representation  (PAR)  (Bindiganavale, 
Schuler  et  al.  2000).  We  believe  that  PARs  can  be  used  to  both  recognize  actions  from 
motion  capture  data  (Bindiganavale  and  Badler  1998)  and  fill  in  necessary  semantics  that 
may  not  be  found  directly  in  any  of  the  audio,  video,  or  motion  capture  streams.  This 
expert  authoring  application  is  our  next  challenge. 


5  Selected  Bibliography 

In-text  references  are  listed  in  this  section. 
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http://www.augmented-realitv.org/ismar/ 

[115]  On  the  Resources  -  Augmented  Reality, 
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